[rollout] chore: bump up trtllm image version to 1.3.0rc10#5841
[rollout] chore: bump up trtllm image version to 1.3.0rc10#5841Superjomn wants to merge 23 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the TRT-LLM rollout implementation and Docker environment, including upgrading Megatron-LM to v0.16.0 and transitioning to SleepConfig for server management. It also removes the single-node restriction for TRT-LLM replicas. Review feedback points out that using a branch name for the DeepEP dependency in the Dockerfile compromises build reproducibility and identifies a potential IndexError in the placement group indexing logic that requires a bounds check.
|
Should we also bump ci image? |
hchings
left a comment
There was a problem hiding this comment.
Please address the cherry-pick comment. Other parts LGTM.
440c5f3 to
029b394
Compare
| "model_extra", | ||
| "executor_extra", | ||
| "model", | ||
| "model_weights", |
There was a problem hiding this comment.
@Superjomn Is this section for backward compatibility (for older trtllm version before we have the fine-grained labels)? If yes, then it should be exactly the same as the old tags at https://github.com/verl-project/verl/pull/5841/changes#diff-4d19b99d5dc8054a16c391ce00301671727c4c3549ecb6d904d33c2aa1f552beL263 (aka without model_weights and draft_model_weights). Otherwise I think it'll error out.
There was a problem hiding this comment.
Sure, let me update it.
Update DeepEP branch from v1.2.1 to hybrid-ep (removing the now-unnecessary patch) and add CCCL CPATH for build compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> cleanup
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d import Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ExecutorMemoryType is not yet available in trtllm v1.3.0rc10. Use try/except fallback so sleep mode gracefully degrades on older versions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The TRT-LLM 1.3.0rc10 base image ships with a newer CUDA toolkit that requires a newer NVIDIA driver than the CI runners have, causing "No CUDA GPUs are available" errors. Adding the cuda-compat package and setting LD_LIBRARY_PATH enables the container to run on hosts with older drivers (>= R535). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…allback _WEIGHTS_TAGS The fallback else branch (for older trtllm without ExecutorMemoryType) should match the original hard-coded list, which never included these tags. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…m Dockerfile The cuda-compat fix does not address the actual CI failure (FlashInfer check_cuda_arch() error). Remove the unneeded apt install and LD_LIBRARY_PATH override added in 029b394. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
d475420 to
6e88b33
Compare
…detection Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
The Volcengine runner API (vemlp-github-runner@v1) can only pull images from the Volcengine registry. Using the DockerHub image caused a jq parse failure (exit code 5) in the setup job. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ibility FlashInfer's check_cuda_arch() fails when SM 7.0 (Volta) is included. Set explicit arch list (7.5;8.0;8.9;9.0;10.0;12.0+PTX) on all 4 GPU jobs to match CI L20 runners (SM 8.9). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
What does this PR do?
This PR bump up the trtllm docker image to v1.3.0rc10.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.